Quantum learning without quantum memory 
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A quantum learning machine for binary classification of qubit states that does not require quan- 
tum memory is introduced and shown to perform with the very same error rate as the optimal 
(programmable) discrimination machine for any size of the training set. At variance with the latter, 
this machine can be used an arbitrary number of times without retraining. Its required (classi- 
cal) memory grows only logarithmically with the number of training qubits, while (asymptotically) 
its excess risk decreases as the inverse of this number, and twice as fast as the excess risk of an 
"estimate-and-discriminate" machine, which estimates the states of the training qubits and classifies 
the data qubit with a discrimination protocol tailored to the obtained estimates. 
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The advent of quantum computation, most particu- 
larly the discovery of Shor's algorithm, has not only trig- 
gered the urge to find suitable implementations of a quan- 
tum computer but has also prompted theorists to devise 
new techniques to supplement the initially small quan- 
tum computing toolbox, of which amplitude amplifica- 
tion (Grover) and fast transforms (Shor) are still the most 
prominent examples. The mantra is that with such tool- 
box, when well-supplied, quantum computers will largely 
outperform classical computers in many of the tasks the 
latter carry out. Quantum simulators, though still lack- 
ing many of the features of a full-fledged quantum com- 
puter, perform tasks of a more "quantum nature" , which 
cannot be efficiently carried out by a classical computer. 
Namely, they have the ability to simulate complex quan- 
tum dynamical systems of interest (enabling control of 
their Hamiltonian parameters). As individual quantum 
systems play a more featured role in labs (and, eventu- 
ally, in everyday life), genuinely quantum tasks will be 
needed, such as dynamical control, state identification, or 
state classification. Quantum information techniques are 
already being developed in order to execute these tasks 
efficiently 

This letter is concerned with a simple, yet fundamental 
instance of quantum state classification. A source pro- 
duces two unknown pure qubit states with equal prob- 
ability. A human expert (who knows the source speci- 
fications, for instance) classifies states produced by this 
source into two sets of size n and attaches the labels 
and 1 to them. We view these In states as a training 
sample, and we set ourselves to find a universal machine 
that uses this sample to assign the right label to a new 
unknown state produced by the same source. 

The above can be understood as a supervised quantum 
learning problem, as has been noticed by Guta and Kot- 
lowski in their recent work [l| (though they use a slightly 
different setting) . Learning theory, more properly named 
machine learning theory, is a very active and broad field 



which roughly speaking deals with algorithms capable of 
learning from experience 0] . Its quantum counterpart Q 
not only provides improvements over some classical learn- 
ing problems but also has a wider range of applicability, 
which includes the problem at hand. Quantum learning 
has also strong links with quantum control theory and is 
becoming a significant element of the quantum comput- 
ing and quantum information processing toolboxes. 

The attentive reader will immediately recognize the bi- 
nary classification problem that we have introduced to be 
a particular case of the so called programmable discrim- 
ination An optimal machine for programmable dis- 
crimination, which sets an absolute limit on the minimum 
classification error, consists in a suitable joint measure- 
ment on both the 2n training qubits and the qubit we wish 
to label, where the observed outcome determines which 
of the two labels, or 1, must be assigned. Note that, 
in principle, this procedure requires keeping the training 
sample in a quantum memory till the very moment we 
need to label (or classify) the unknown qubit. Indeed, the 
so far known explicit constructions of optimal machines 
for programmable discrimination have this feature ■ 
Moreover, as it will become clear below, after the joint 
measurement, no useful information about the training 
set (TS) is left. These constructions would be hence use- 
less for classifying a second unknown qubit produced by 
the same source unless another TS were provided. This 
may be seen as impractical in many situations. 

The aim of this letter is to construct a learning machine 
(LM) that requires much less resources while still having 
the same accuracy as the optimal programmable discrim- 
ination machine for any size 2n of the TS, not necessarily 
asymptotically large. All relevant information about the 
TS is kept in a classical memory, thus classification can 
be executed any time after the learning process is com- 
pleted. As an added bonus, once trained, this machine 
can be subsequently used an arbitrary number of times 
to classify states produced by the same source. 
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At this point it should be noted that state estimation 
and state discrimination are two primitives from which 
LMs without quantum memory can be naturally assem- 
bled. We will refer to these specific constructions as 
"estimate-and-discriminate" (E&D) machines. The pro- 
tocol they execute is as follows: by performing, e.g., an 
optimal covariant measurement on the n qubits in the 
TS labeled 0, their state is estimated with some ac- 
curacy, and likewise the state \tpi) of the other n qubits 
that carry the label 1 is characterized. This classical in- 
formation is stored and subsequently used to discriminate 
an unknown qubit state. It will be shown that the excess 
risk (i.e., excess average error over classification when the 
states \i(jo) and are perfectly known) of this protocol 
is twice that of the optimal LM. The fact that the E&D 
machine is suboptimal means that the kind of informa- 
tion retrieved from the TS and stored in the classical 
memory of our optimal LM is specific to the classifica- 
tion problem at hand, and that the machine itself is more 
than the mere assemblage of well known protocols. 

Before presenting our results, let us summarize what 
is known about optimal machines for programmable dis- 
crimination. This will also allow us to introduce our no- 
tation and conventions. The TS of size 2n is given by 
a state pattern of the form [ipf n ] (g) [tf)f n ], where the 
shorthand notation [ • ] = | ■ ) ( ■ | will be used through- 
out the letter, and where no knowledge about the ac- 
tual states \ipo) an d is assumed (the figure of merit 
will be an average over all states of this form). The 
qubit state that we wish to label (the data qubit) be- 
longs either to the first group (it is [ipo]) or to the second 
one (it is [tpi])- Thus, the optimal machine must dis- 
criminate between the two possible states: either po = 
[tpo ]ab ® [ipf n ]c, in which case it should output 
the label 0, or p t = \^f n ] A ® [4>f ( ™ +1) ]sc, in which case 
the machine should output the label 1. Here and when 
needed for clarity, we label the three subsystems involved 
in this problem as A, B and C, where AC is the TS and B 
is the data qubit. In order to discriminate p$ from p\, a 
joined two-outcome measurement, independent of the ac- 
tual states IV'o) and |^>i), is performed on all 2n+I qubits. 
Mathematically, it is represented by a positive operator 
valued measure (POVM) £ = {E = E, E x = 1 - E}. 
The minimum average error probability of the label- 
ing/discrimination process is given by P e = (1 — A/2)/2, 
where A = max^ j dipo dipi tr [(po — pi) E}. This aver- 
age can be turned into a SU(2) group integral and, in 
turn, it can be readily computed using Schur lemma to 
give 

A = maxtr [(cr - cr x ) E] =\\ a - (J\ ||i, (1) 

E 

where <t /i are average states defined as do = 
l/(d n d n+ i)t n+ i ® l n , oi = l/(d n d n+1 )t n ® l n+ i, and 
|| ■ ||i is the trace norm. In this letter l m stands for the 
projector on the fully symmetric invariant subspace of m 



qubits, which has dimension d m = m+ 1. The maximum 
in |T]) is attained by choosing E to be the projector onto 
the positive part of do — G \ ■ 

The right hand side of (JT]) can be computed by switch- 
ing to the total angular momentum basis, {|J, M}}, 
where l/2<J<n + f/2 and — J < M < J (an ad- 
ditional label may be required to specify the way subsys- 
tems couple to give J; see below). In this basis the prob- 
lem simplifies significantly, as it reduces to pure state 
discrimination |8[ on each subspace corresponding to a 
possible value of the total angular momentum J and 
magnetic number M. By writing the various values of 
the total angular momentum as J = k + 1/2, the final 
answer takes the form Q: 

1 i » 

2 d n d n+1 ^ 

The formula above gives an absolute lower bound to 
the error probability that can be physically attainable. 
We wish to show that this bound can actually be at- 
tained by a learning machine device that uses a classical 
register to store all the relevant information obtained in 
the learning process regardless the size, In, of the TS. 
A first hint that this may be possible is that one can 
choose the optimal measurement £ to have positive par- 
tial transposition with respect to the partition TS/data 
qubit. Indeed this is a necessary condition for any mea- 
surement that consists of a local POVM on the TS whose 
outcome is fed-forward to a second POVM on the data 
qubit. This class of one-way adaptive measurement can 
be characterized as: 

where the positive operators (D^) act on the Hilbert 
space of the TS (data qubit we wish to classify), and 
= l n ® 1„. The POVM L = {L^} represents the 
learning process, and the parameter p, which a priori may 
be discrete or continuous, encodes the information gath- 
ered in the measurement and required at the classification 
stage. For each possible value of p, = {D^, li — D^} 
defines the measurement on the data qubit, whose two 
outcomes represent the discrimination or classification 
decision. Clearly, the size of the required classical mem- 
ory will be determined by the information content of the 
random variable p. 

To gain insight into the form of the optimal L M , we 
trace subsystems AC in (JXJ) , which for the LM reads 
A LM = max (£i{IV}) tr [(a ~ <Ji)E }. If we write A LM = 
max£ A£ , we have 

A £ = 5> M II P% - Pi II i= $>/X - r% (4) 

where (Tq,-^) Pqm = p^trAcL^ao/i are the (Bloch vec- 
tors of the) normalized states of qubit B conditioned to 
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the outcome \i of the L measurement, p^ — tr L^/d 2 is 
the probability of getting that outcome, and the maxi- 
mum over {D^} is attained by the set of projectors over 
the positive parts of {pg — p 1 ^}. Note that we may choose 
the elements of L to be rank-one, as follows from a direct 
application of the triangle inequality to i.e. without 
loss of generality, we can write L M = where |</> M ) 

is normalized so that ((^|</> M ) = <i 2 . 

To gain more intuition, we focus on the simplest case 
of n = 1. In the total angular momentum basis for 
AC we have \<j^) = E™=_i<Jl>™) + ^|0,0), with 
b^ > and (<f>^\(j)^) = 4. After some algebra we 

obtain A £ = (l/3)E M ^ M V (Ka o )2 + ^ ~ ~ a -^<^ 
where the bar stands for complex conjugate. To ob- 
tain A, we have to maximize this expression over all 
possible values of a^, 6 M , and p M consistent with L 
being a POVM (a resolution of 1„ ® 1„), in partic- 
ular £ M p M |<| 2 = E„P^) 2 = 1. An attainable 
upper bound to A £ can be derived by applying the 
Schwarz inequality to the "vectors" {^/pib 1 , y/pzb 2 , . . . ) 

...). Af- 



and (VPlV^oJ? 



ter taking into account the POVM conditions the bound 
reads A £ 



< 



(1/6W6 + 2RV p»MY 



2<c£ 1 ] 



The Schwarz inequality can be applied a second time 
along a similar line to finally obtain A £ < This 
bound leads to the same minimal error probability as 
the optimal discriminator, as can be read off from ©. 
The attainability conditions can be cast as b^ /V — 
[«) 2 - 2c#e£ 1 ]/[(ag) 2 - 2ala v _ x ] for any pair of out- 



comes /i and v, Qq G R and a'l 



Since both b^ 



and (ag) 2 — 2aiat 1 are rotation invariant, the first con- 
dition is trivially met if 1^) = U^\(jP) for a fix (seed) 
state \4>°), where throughout this letter, U stands for an 
element of the appropriate representation of SU(2) , which 
should be obvious by context (in this case — «® 2n , 
where is in the fundamental representation). In short, 
the POVM L can be chosen to be covariant. The re- 
maining attainability conditions above are met with an 
appropriate choice of the seed state \(f>°). The sim- 
plest one is = V3|l, 0) + |0,0). This state has 
a well defined magnetic number m = and, as a re- 
sult, the Bloch vectors r °, of the data qubit con- 
ditioned to outcome [i = point in the ±z direction. 
Hence D = \h,h){h,h\ = [t], and © becomes 



E o = ([0°] ® [t]) Vl 



(•5) 



(the element E\ is obtained by replacing [f] with 
||,-|>(|,-|| = [|]). It is known Q that the min- 
imal covariant POVM of this form has four outcomes, 
/i = 1,2,3,4, each occurring with equal probability 
= 1/4 (2 bits of memory are needed to store the 
information contained in p) 1 corresponding to the four 
unitary transformations (rotations) that take the unit 
vector z to the vertices of a regular tetrahedron. 



We are now in the position to prove our main re- 
sult: the obvious generalization of ([5]) to arbitrary n 
which requires that \4>° 



0) to be a 

POVM, gives an error probability P C LM = (1- A LM /2)/2 
equal to the minimum error probability P c in ([2J and 
is, therefore, optimal. The proof goes as follows. From 
the very definition of error probability we have P^ M = 
(traxEo + tva Ei)/2 = {tr(l A <8> 1bc[4>°] ® [t]) + 
tr (Iab <8> lc[<^°] <8> [i])}/(2d n d n +i), where we have used 
rotational invariance. We can further simplify this ex- 
pression by writing it as F C LM = (|| 1 A ® t B c\4>°)\t) \\ 2 
+ \\t AB ®t c \<f>°)\l)\\ 2 )/(2d n d n+1 ). To compute the pro- 
jections inside the norm signs we first write \<j>°)\ t) 
(|^°) | 4-) will be considered below) in the total angu- 
lar momentum basis | J, M) {AC)B , where the attached sub- 
scripts remind us how subsystems A, B and C are both 
ordered and coupled to give the total angular momen- 
tum J (note that a permutation of subsystems, prior to 
fixing the coupling, can only give rise to a global phase, 
thus not affecting the value of the norm we wish to com- 
pute). This is a trivial task since |0°)|f) = \<P°) Ac\t) B, 
i. e., subsystems are ordered and coupled as the subscript 
(AC)B specifies: we just need the Clebsch-Gordan coef- 
ficients (j ± I 1 1 j, 0; \, ±) = ± y/(j + \ ± §)/(2j + 1). 

The projector t A ® &bc, however, is naturally written 
as 1 A ® Isc = Ej.m I J ' m )a( CB) (J, M\. This basis differs 
from that above in the coupling of the subsystems. To 
compute the projection 't A ®'tBc\4 ,0 )W) we om y have to 
know the overlaps between the two basis. Wigner's 6j 
symbols 11| provide this information as a function of 
the angular momenta of the various subsystems: j A = 
3c = n/2, j B = 1/2, 3ac = 3, Jcb = d n /2 and J = 
j ±\. The releva nt overlaps axe ApB) (j±\, ||j±|, \\aob = 

Vn+fiO' + D/V^. 

Using the Clebsch-Gordan coefficients and the overlaps 
of the previous paragraphs, it is not difficult to obtain 



n+l 



i A ®i BC |0°)|t)=X)v^ 



V2d n 



\j hk) A ( OB )> (^) 



An identical expression can be obtained for 1 A b <8 
lc|0°)|4-) in the basis \J,M) (BA)C . To finish the proof, we 
compute the norm squared of ^ and divide by d n d n +i. 
It is easy to check that this gives the error probability of 
the optimal discriminator in ©. □ 
Let us go back to the POVM condition, specifically to 
the minimum number of unitary transformations needed 
to ensure that {p^^I^ ]^} in (JSJ) is a resolution of 
the identity for arbitrary n. This issue is addressed 
in [lOj . where an explicit algorithm for constructing fi- 
nite POVMs, including the ones we need here, is given. 
From the results there, we can bound the minimum num- 
ber of outcomes of £ by 2(n + l)(2n +1). This figure is 
important because its binary logarithm gives an upper 
bound to the minimum memory required. We see that it 
grows at most logarithmically with the size of the TS. 
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E&D machines can be discussed within this very frame- 
work, as they are particular instances of LMs. In this 
case the POVM £ has the form L ai = M a ® M[, where 
M = {M a } and M' = {M<} are themselves POVMs on 
subsystems A and C respectively. The role of M and M' 
is to estimate (optimally) the qubit states in the TS A 
and C respectively. The measurement on B (the data 
qubit) now depends on the pair of outcomes of M and M': 
Dai = {D a i, li — D a i}. It performs standard one-qubit 
discrimination according to the two pure-state specifica- 
tions, say, the unit Bloch vectors Sq and s\, estimated 
with M and M'. In the last part of this letter, we wish to 
show that E&D machines perform worse than the opti- 
mal LM. The starting point is the E&D version of Eq. (j4j, 
which now becomes 

A L ^Y,P-P>o-r\\ (7) 

ai 

(the notation is self-explanatory), where the Bloch vec- 
tors Tq and r\ of the states conditioned to the out- 
comes a and i are proportional to Sq and s\ respectively. 

Contrary to what one may expect, POVMs that are 
optimal, and thus equivalent, for estimation may lead to 
different minimum error probabilities. In particular, the 
continuous covariant POVM is outperformed in the prob- 
lem at hand by those with a finite number of outcomes. 
Optimal POVMs with few outcomes enforce large angles 
between the estimates Sq and s\, and thus between r% 
and r\ (tt/2 in the n = 1 example below). This translates 
into increased discrimination efficiency, as shown by ([7]), 
without compromising the quality of the estimation it- 
self. Hence the orientation of M relative to M' (which 
for two continuous POVMs docs not even make sense) 
plays an important role, as it does the actual number of 
outcomes. With an increasing size of the TS, the opti- 
mal estimation POVMs require also a larger number of 
outcomes and the angle between the estimates decreases 
in average, since they tend to fill the 2-sphere isotropi- 
cally. Hence, the minimum error probability is expected 
to approach that of two continuous POVMs. This is sup- 
ported by numerical calculations. The problem of finding 
the optimal E&D machine for arbitrary n appears to be 
a hard one and is currently under investigation. Here we 
will give the absolute optimal E&D machine for n = 1 
and, also, we will compute the minimum error proba- 
bility for both M and M' being the continuous POVM 
that is optimal for estimation. The later, as mentioned, 
is expected to attain the optimal E&D error probability 
asymptotically. 

We can obtain an upper bound on ([7]) by apply- 
ing the Schwarz inequality in a similar fashion as 
we did for the optimal LM. We readily find that 
A£ < VE ai P«ti\r%-r\\ 2 = VEaP a \rd\ 2 +E l P' l \r\\ 2 , 
where we have used that Y^aP°> r o ~ J2iPi r o ~ 0> as 
follows from the POVM condition on M and M'. The 
maximum norm of j-q and r\ can be easily seen to 



be bounded by 1/3 [n/(n + 2) for arbitrary n\. Thus 
A £ < y/2/3 < 1/V3 = A LM . This bound is attained by 
the choices Mjy^ = [t/l] an d Af+/_ = [+/—], where, as 
usual, |±) = (|t) ±|4»A/2. 

For arbitrary n, a simple expression can be de- 
rived in the continuous POVM case, M = M' = 
{d n U s [(jP]Ul} s& 2, where \<jp) = || , §), s is a unit vec- 
tor (a point on the 2-sphere S 2 ) and U s is the represen- 
tation of the rotation that takes z into s. Here s labels 
the outcome of the measurement and thus plays the role 
of a and i. The states of the data qubit conditioned to 
outcome s are p s 0/l = (d n /d n+ i)tr A/c U s [(/) }W s l AB / BC . 
Using rotational symmetry arguments one has Pqm = 
u s {{d n /d n+ i)tY A/c [(jP]t AB / BC } u\=u s plj x u\. After a 

simple calculation we obtain Pq/ x = {d n [ f ] + [ I ] } / d n+ i . 
This means that the Bloch vector of the data qubit condi- 
tioned to outcome s is proportional to s and it is shrunk 
by a factor n/d n+ i. Hence, the continuous version of ([Jj 
is A E&D = (n/d n+ i) Jds\z-s\ = in/(3d n+ i). Asymp- 
totically, we have P e E&D = 1/6 + 2/ (3n) + . . . . Therefore, 
the excess risk, defined here as the difference between 
the average error probability of a universal discrimination 
machine and that of the optimal discrimination protocol 
for known qubit states (1/6) is i? E&D = 2/(3n) + .... 
This is twice the excess risk of the optimal discrimi- 
nator and the optimal LM, which can be worked out 
from @: R LM = fl°P* = l/(3n) + . . . . For n = 1 we 
have i? E&D = (4 - V2)/12, which is already 15% larger 
than R LM = (4- V3)/12. 

We have presented a supervised quantum learning 
machine which classifies an unknown qubit after being 
trained with a number of already classified qubits. Its 
performance attains the absolute bound given by the op- 
timal programmable discrimination machine with full si- 
multaneous access to all training and data qubits. In 
contrast to programmable machines, this learning ma- 
chine does not require a quantum memory and can also 
be reused without retraining, which makes it much more 
versatile. Some relevant generalizations of the problem 
to, e.g., mixed and higher dimensional states, are cur- 
rently under study. A challenging problem with direct 
practical applications in quantum control and informa- 
tion processing is the extension to unsupervised machines 
where the training qubits are unclassified. 
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